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(54) Method and apparatus for fetching noncontiguous instructions in a data processing system 



(57) A method and apparatus for obtaining non-con- 
tiguous blocks of instruction in a data processing system 
is disclosed. The apparatus comprises an instruction 
cache means for providing a first plurality of instructions 
and branch logic means for receiving the first plurality 
of instructions and for providing branch history informa- 
tion about the first plurality of instructions. The appara- 
tus further includes an auxiliary cache means for receiv- 
ing a second plurality of instructions based upon the 



branch history information. The auxiliary cache means 
overlays at least a one of the second plurality of instruc- 
tions if there is a branch in the first plurality of instruc- 
tions and the branch is to the second plurality of instruc- 
tions. Thus the apparatus can use branch history infor- 
mation and an auxiliary cache to fetch multiple noncon- 
tiguous groups of instructions in a single cycle. Further- 
more, the technique allows noncontiguous fetching to 
be performed without requiring multiple levels of nested 
branch prediction logic to be evaluated in a single cycle. 
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Description 

[0001] The present invention relates to a system and 
method for fetching noncontiguous instructions in a data 
processing system. 

[0002] Superscalar processors employ aggressive 
techniques to exploit instruction-level parallelism. Wide 
dispatch and issue paths place an upper bound on peak 
instruction throughput. Large issue buffers are used to 
maintain a window of instructions necessary for detect- 
ing parallelism, and a large pool of physical registers 
provides destinations for all of the in-flight instructions 
issued from the window. To enable concurrent execution 
of instructions, the execution engine is composed of 
many parallel functional units. The fetch engine specu- 
lates past multiple branches in order to supply a contin- 
uous instruction stream to the window. 
[0003] The trend in superscalar design is to scale 
these techniques: wider dispatch/issue, larger windows, 
more physical registers, more functional units, and 
deeper speculation. To maintain thistrend, it is important 
to balance all parts of the processor-any bottlenecks di- 
minish the benefit of aggressive techniques. 
[0004] Instruction fetch performance depends on a 
number of factors. Instruction cache hit rate and branch 
prediction accuracy have been long recognized as im- 
portant problems in fetch performance and are well-re- 
searched areas. 

[0005] Because of branches and jumps, instructions 
to be fetched during any given cycle may not be in con- 
tiguous cache locations. Hence, there must be ade- 
quate paths and logic available to fetch and align non- 
contiguous basic blocks and pass them down the pipe- 
lines. That is, it is not enough for the instructions to be 
present in the cache, it must also be possible to access 
them in parallel. 

[0006] Modern microprocessors routinely use Branch 
History Tables and Branch Target Address Caches to 
improve their ability to efficiently fetch past branch in- 
structions. Branch History Tables and other prediction 
mechanisms allow a processor to fetch beyond a branch 
instruction before the outcome of the branch is known. 
Branch Target Address Caches allow a processor to 
speculatively fetch beyond a branch before the branch's 
target address has been computed. Both of these tech- 
niques use run-time history to speculatively predict 
which instructions should be fetched and eliminate 
"dead" cycles that might normally be wasted. Even with 
these techniques, current microprocessors are limited 
to fetching only contiguous instructions during a single 
clock cycle. 

[0007] As superscalar processors become more ag- 
gressive and attempt to execute many more instructions 
per cycle, they must also be able to fetch many more 
instructions per cycle. Frequent branch instructions can 
severely limit a processor's effective fetch bandwidth. 
Statistically, one of every four instructions is a branch 
instruction and over half of these branches are taken. A 



processor with a wide fetch bandwidth, say 8 contiguous 
instructions per cycle, could end up throwing away half 
of the instructions that it fetches as much as half of the 
time. 

5 [0008] High performance superscalar processor or- 
ganizations divide naturally into an instruction fetch 
mechanism and an instruction execution mechanism. 
The fetch and execution mechanisms are separated by 
instruction issue buffer(s), for example, queues, reser- 

10 vation stations, etc. Conceptually, the instruction fetch 
mechanism acts as a "producer" which fetches, de- 
codes, and places instructions into the buffer. The in- 
struction execution engine is the "consumer" which re- 
moves instructions from the buffer and executes them, 

is subject to data dependence and resource constraints. 
Control dependences (branches and jumps) provide a 
feedback mechanism between the producer and con- 
sumer. 

[0009] Previous designs use a conventional instruc- 

20 tion cache, containing a static form of the program, to 
work with. Every cycle, instructions from noncontiguous 
locations must be fetched from the instruction cache and 
assembled into the predicted dynamic sequence. There 
are problems with this approach: 

2$ [0010] Pointers to all of the noncontiguous instruction 
blocks must be generated before fetching can begin. 
This implies a level of indirection, through some form of 
branch target table (branch target buffer, branch ad- 
dress cache, etc.), which translates into an additional 

30 pipeline stage before the instruction cache. . 

[0011] The instruction cache must support simultane- 
ous access to multiple, noncontiguous cache lines. This 
forces the cache to be multiported: if multiporting is done 
through interleaving, bank conflicts are suffered.. 

35 [0012] After fetching the noncontiguous instructions 
from the cache, they must be assembled into the dy- 
namic sequence. Instructions must be shifted and 
aligned to make them appear contiguous to the decoder. 
This most likely translates into an additional pipeline 

40 stage after the instruction cache. 

[0013] A trace cache approach avoids these prob- 
lems by caching dynamic sequences themselves, ready 
for the decoder. If the predicted dynamic sequence ex- 
ists in the trace cache, it does not have to be recreated 

45 on the fly from the instruction cache's static representa- 
tion. In particular, no additional stages before or after 
the instruction cache are needed for fetching noncon- 
tiguous instructions. The stages do exist, but not on the 
critical path of the fetch unit-rather, on the fill side of the 

50 trace cache. The cost of this approach is redundant in- 
struction storage: the same instructions must reside in 
both the primary cache and the trace cache, and there 
even might be redundancy among lines in the trace 
cache. Accordingly, utilizing a trace cache approach 

55 several instructions are grouped together based upon a 
most likely path. They are then stored together in the 
trace cache. This system requires a complex mecha- 
nism to pack and cache instruction segments. 
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[0014] Accordingly, a need exists for a technique for 
improving the overall throughput of a superscalar proc- 
essor. More particularly, what is needed is a system and 
method for efficiently fetching noncontiguous instruc- 
tions in such a processor. 

[0015] A method and system for obtaining non-con- 
tiguous blocks of instruction in a data processing system 
is disclosed. In a first aspect, apparatus for fetching non- 
contiguous blocks of instructions in a data processing 
system is disclosed. The apparatus comprises an in- 
struction cache means for providing a first plurality of 
instructions and branch logic means for receiving the 
first plurality of instructions and for providing branch his- 
tory information about the first plurality of instructions. 
The apparatus further includes an auxiliary cache 
means for receiving a second plurality of instructions 
based upon the branch history information. The auxiliary 
cache means overlays at least a one of the second plu- 
rality of instructions if there is a branch in the first plu- 
rality of instructions and the branch is to the second plu- 
rality of instructions. 

[0016] In a second aspect, a method for obtaining 
noncontiguous blocks of instruction comprises storing a 
first plurality of instructions in a first cache and fetching 
the first plurality of instructions in parallel with a fetch of 
a second plurality of instructions within a second cache. 
In the present invention, the number of the second plu- 
rality of instructions being greater than the number of 
first plurality of instructions. This second aspect includes 
replacing a portion of the second plurality of instructions 
with at least one of the first plurality of instructions based 
upon a branch history information of the data processing 
system. 

[001 7] The above-described present invention allows 
a processor to use branch history information and an 
auxiliary cache to fetch multiple noncontiguous groups 
of instructions in a single cycle. Furthermore, the tech- 
nique allows noncontiguous fetching to be performed 
without requiring multiple levels of nested branch pre- 
diction logic to be evaluated in a single cycle. 
[001 8] An embodiment of the invention will now be de- 
scribed, by way of example only, with reference to the 
accompanying drawings in which: 

Figure 1 is a block diagram of a superscalar proc- 
essor; 

Figure 2A is a block diagram of a conventional 
mechanism within a processor for fetching noncon- 
tiguous instructions; 

Figure 2B is a block diagram of an instruction cache 
and branch target address cache entry; 

Figure 3 is a flow chart of a branch prediction algo- 
rithm for the conventional mechanism of Figure 2; 



a processor for fetching noncontiguous instructions 
in a single cycle in accordance with the embodiment 
of the present invention; 

s Figure 5 is a flow chart of the branch prediction al- 
gorithm for the noncontiguous instruction fetch 
mechanism of Figure 5; and 

Figure 6 is a table that illustrates the flow of instruc- 
10 tions when utilizing the branch prediction algorithm 
of Figure 5. 

[0019] Figure 1 is a block diagram of a superscalar 
processor 10. As shown, the superscalar processor 10 

15 typically include a system bus 11 connected to a bus 
interface unit ("BIU") 12. BIU 12 controls the transfer of 
information between processor 10 and system bus 11. 
Bl U 1 2 is connected to an instruction cache 1 4 and to a 
data cache 1 6 of processor 1 0. Instruction cache 1 4 out- 

20 puts instructions to a sequencer unit 1 8. In response to 
such instructions from instruction cache 14, sequencer 
unit 18 selectively outputs instructions to other execu- 
tion circuitry of processor 10. 

[0020] In addition to sequencer unit 1 8 which includes 

25 execution units of a dispatch unit 46 and a completion 
unit 48, in the preferred embodiment the execution cir- 
cuitry of processor 10 includes multiple execution units, 
namely a branch unit 20, a fixed point unit A ("FXUA") 
22, a fixed point unit B ("FXIIB") 24, a complex fixed 

30 point unit ("CFXU") 26, a load/store unit ("LSU") 28 and 
a floating point unit ("FPU") 30. FXUA 22, FXUB 24, 
CFXU 26 and LSU 28 input their source operand infor- 
mation from general purpose architectural registers 
("GPRs") 32 and fixed point rename buffers 34. Moreo- 

35 ver, FXUA 22 and FXUB 24 input a "carry bit" from a 
carry bit ("CA") register 42. FXUA 22, FXUB 24, CFXU 
26 and LSU 28 output results (destination operand in- 
formation) of their operations for storage at selected en- 
tries in fixed point rename buffers 34. Also, CFXU 26 

40 inputs and outputs source operand information and des- , 
tination operand information to and from special pur- 
pose registers ( n SPRs°) 40. 

[0021] FPU 30 inputs its source operand information 
from floating point architectural registers ("FPRs") 36 
45 and floating point rename buffers 38. FPU 30 outputs 
results (destination operand information) of its operation 
for storage at selected entries in floating point rename 
buffers 38. 

50 Processes 

[0022] The processor 10 is usually implemented with 
a large number of state machines that control relatively 
independent processes. It may be thought of as a com- 
55 plex parallel algorithm involving multiple concurrent 
processes. 



Figure 4 is a block diagram of a mechanism within 
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Instruction Fetch 

[0023] This process provides a continuous stream of 
instructions from the instruction cache and utilizes the 
branch target address cache (BTAC) as a fetch predic- 5 
tion mechanism. 

Branch Prediction 

[0024] This process identifies and predicts branches, 
verifies that the appropriate instructions were fetched, 
updates information, and places information about 
speculative branches into a Branch Queue. 

Branch Resolution 

[0025] This process checks that the prediction 
matched the actual branch result and makes corrections 
if a misprediction occurred. 

Branch Completion 

[0026] This process writes information for completed 
branches into the BHT and removes entries from the 
Branch Queue. 

[0027] The present embodiment relates generally to 
the fetch cycle and the ability to fetch noncontiguous in- 
structions. Figure 2A shows in system 1 00 the hardware 
mechanisms for the conventional technique for fetching 
groups of instructions. In this embodiment, eight instruc- 
tions are shown being fetched at a time. In addition, the 
present invention is discussed with reference to instruc- 
tions being four bytes in length. However, one of ordi- 
nary skill in the art, readily recognizes than any number 
of instructions can be fetched at a time and can be of 
any length and that number and length would be within 
the scope of the present invention. 
[0028] Referring back to Figure 2A, as is seen there 
is a fetch address signal 102 which is provided to a 
branch history table (BHT) 104, an instruction cache 
106, a branch target address cache (BTAC) 108, a di- 
rectory for the instruction cache (INST Dir) 110, +32 
counter 1 1 1 . To provide a fuller understanding of the op- 
eration of the above-identified mechanism, particularly 
as it relates to the instruction cache 106, a conventional 
instruction cache entry is described below. 
[0029] Figure 2B shows a simple organization for the 
instruction cache and BTAC entry 200 of the instruction 
required by an instruction fetcher (this entry may include 
other information that is not important for this discus- 
sion). Figure 2B shows a fetch information 201 which 
includes a sample address tag 202, successor index 
204, and branch block index entries 206 for a code se- 
quence, assuming a 64-Kbyte, direct-mapped cache 
and the indicated instruction addresses. For this exam- 
ple, the cache entry holds four instructions 208, 210,212 
and 214. The entry also contains instruction-fetch infor- 
mation. The fetch information also includes two addi- 



tional fields (not shown) used by the instruction fetcher. 
[0030] The successor index field 204 indicates both 
the next cache block predicted to be fetched and the first 
instruction within this next block predicted to be execut- 
ed. The successor index field 204 does not specify a full 
instruction address, but is of sufficient size to select any 
instruction within the cache. For example, a 64-Kbyte, 
direct-mapped cache requires a 14-bit successor index 
if all instructions are 32 bits in length (12 bits to address 
the cache block and 2 bits to address the instruction in 
the block if the block size is four words). 
[0031] In a preferred embodiment, the branch block 
index field 206 indicates the location of a branch point 
within the corresponding instruction block. Instructions 
beyond the branch point are predicted not to be execut- 
ed. 

[0032] Referring back to Figure 2A, the BHT 1 04 also 
receives a BHT update signal and outputs a read signal. 
The read signal from the BHT 104 is provided to branch 
logic 116. The instruction cache 106 receives write sig- 
nal from an outside source, such as an L2 cache. The 
instruction cache 106 outputs eight instructions (instruc- 
tion group 0) to the branch logic 1 1 6. An address 0 signal 
is provided directly to the branch logic 116. The branch 
logic 116 provides an override address signal to multi- 
plexer 120. Multiplexer 120 also receives signals 32, 
counter 111 and the output of BTAC 102. An address 1 
signal is provided from BTAC 108 to branch logic 116. 
The instruction directory 110 provides a hit signal to the 
branch logic 116. The branch logic 116 also receives the 
branch outcome signal, provides branch information to 
a branch queue 1216, and outputs a BTAC address 1 28 
and provides valid instructions 1 24. This type of mech- 
anism is capable of fetching eight contiguous instruc- 
tions per cycle, but would only use instructions up to the 
first predicted taken branch in the group. To explain this 
in more detail refer to the following discussion in com- 
bination with the accompanying figures. 
[0033] As before described, there are several proc- 
esses associated with the fetching of groups of instruc- 
tions. The present invention is related to an improve- 
ment in the branch prediction algorithm and an associ- 
ated modification to the conventional fetch mechanism 
of Figure 2A. 

[0034] To further illustrate the problems associated 
with the fetching of noncontiguous instructions with re- 
gard to the conventional mechanisms of Figure 2A, refer 
now to Figure 3. 

[0035] Figure 3 is a flow chart of a branch prediction 
algorithm for the conventional mechanism of Figure 2A. 
Referring to Figures 2A and 3 together, first it is deter- 
mined whether valid instructions are found in the instruc- 
tion cache, via step 302. If there are no valid instructions 
found in the instruction cache, then all fetched instruc- 
tions are invalidated and the miss handler is initiated, 
via step 304. However, if there are valid instructions 
found in the instruction cache, then branches are iden- 
tified, target addresses are computed, and predicted 
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taken or not taken based on the branch logic 116 and 
the branch history tables 104, via step 306. Thereafter 
it is determined whether there is a predicted taken 
branch in instruction group 0 (the first group of instruc- 
tions), via step 308. If there is a predicted branch taken, 
then all subsequent instructions are invalidated, via step 
310. Then it is determined if the address 1 (address of 
the second group of instructions) from BTAC 108 is 
equal to target 0 of the instruction directory, via step 31 2. 
If the answer is yes, then the branch addresses are 
stored and prediction information for all branches are 
provided into the branch queue 1 26, via step 31 4. If, on 
the other hand, address 1 is not equal to target 0, then 
the instructions that have been fetched in the next cycle 
are invalidated and the override address is equal to tar- 
get 0, via step 316. Thereafter, the BTAC address is up- 
dated to equal target 0, via step 318, and then the 
branch addresses and predicted information is stored 
into the branch queue via step 31 4. If, on the other hand, 
at step 308 there are no predicted branches taken in 
group 0, then it is determined if address 1 is equal to 
address 0 plus 32. If the answer is yes, then return to 
step 314. If, on the other hand, the answer is no, then 
all instruction groups fetched for the next cycle are in- 
validated and the override address equals address 0 + 
32, via step 322. Thereafter, the BTAC address is up- 
dated to equal invalid, via step 324 and then return to 
step 314. 

[0036] This algorithm of Figure 3 does not allow for 
fetching non-contiguous instructions in a single cycle. 
This prediction algorithm always requires that when a 
branch instruction is encountered only the instructions 
up to the branch can be utilized. As has been before- 
mentioned, there are mechanisms, i.e., trace cache, 
etc., to retrieve noncontiguous instructions in a single 
cycle but they add complexity and cost to the system. 
[0037] The present invention overcomes this problem 
by providing an auxiliary cache and an overlaying tech- 
nique that utilizes the auxiliary cache to fetch noncon- 
tiguous instructions in a single cycle. 
[0038] In the present embodiment, three major hard- 
ware mechanisms are required for this technique: 

(1) a Standard Instruction Cache (or other memory 
source), 

(2) a Branch Target Address Cache, and 

(3) an auxiliary cache. 

[0039] A Standard Instruction Cache and a Branch 
Target Address Cache are commonly used in most mi- 
croprocessors and may be used without modification for 
this technique. The auxiliary cache is a new hardware 
mechanism that contains multiple entries with one or ' 
more instructions and an associated address. The aux- 
iliary cache may be highly associative and relatively 
small compared to the main instruction cache. 



[0040] The present embodiment operates generally in 
the following manner: 

Ufa branch instruction in the first instruction group 
5 is considered strongly taken (based on branch his- 
tory or other information) and no instructions were 
provided from the auxiliary cache, use the fetch in- 
dex to add the branch's target address and one or 
more instructions at that address to the auxiliary 
10 cache. Also, an appropriate sequential address is 
provided as needed (e.g., branch target plus 16 
bytes) to the BTAC. 

2. Else, if a branch instruction in the second instruc- 
ts tion group is considered strongly taken, use the 

fetch index to add the branch's target address to the 
BTAC. 

3. Else, if no branch instructions in either instruction 
20 group is considered strongly taken, use the fetch 

index to clear the BTAC and default to an appropri- 
ate sequential address. 

[0041] To more particularly describe the features and 

25 operation of the present embodiment, refer now to the 
following discussion in conjunction with the accompa- 
• nying figures. 
[0042] Figure 4 is a block diagram of a mechanism 
400 within a processor for fetching noncontiguous in- 

30 structions in a single cycle in accordance with the 
present invention. The elements of the mechanism 400 
are similar to many of the elements presently in mech- 
anism 100. Those elements that are similar have been 
given similar reference numerals. As has been before 

35 mentioned, the key element that is different is the addi- 
tion of the auxiliary cache 415 and its directory 417. 
[0043] In addition, as is seen, there are instruction 
groups 0 and 1 and as is seen, there are four multiplex- 
ers 425 which allow for the overlaying of the instructions 

40 from the auxiliary cache 417 in the instruction group 1 
from the instruction cache 106' based on branch history 
information derived from BHT 104* and branch logic 
116*. Similarly, the auxiliary directory also overlays its 
address over the address 1 signal of the +1 6 counter 

45 421 based upon the branch history information. In addi- 
tion, FTAC 41 9 also provides an address 2 signal rather 
than the address 1 signal provided by the BTAC 1 08' of 
Figure 2 A. Accordingly, as has been before mentioned, 
through the addition of the auxiliary cache 415 and the 

50 use of it and the auxiliary directory 417, it is possible 
now to accumulate information to allow for fetching of 
noncontiguous instructions. To further describe this fea- 
ture in a more detailed manner, refer now to Figure 5. 
[0044] Figure 5 is a flow chart of the branch prediction 

55 algorithm for the noncontiguous instruction fetch mech- 
anism of Figure 4. Referring now to Figures 4 and 5 to- 
gether, first it is determined whether there are valid in- 
structions stored in the instruction cache 106', via step 
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502. If there were no valid instructions found in the in- 
struction cache 106', then all instructions are invalidated 
and a miss handler is initiated, via step 504. If, on the 
other hand, there are valid instructions in the instruction 
cache 106", it is next determined whether there were val- s 
id instructions found in the auxiliary cache 41 7, via step 
506. If there were valid instructions found in the auxiliary 
cache 41 7, then the instructions from the auxiliary cache 
417 are overlaid on the instruction group from the in- 
struction cache 1 06', via step 508. If, on the other hand, 10 
there were no valid instructions found in the auxiliary 
cache, then all the instructions from the instruction 
cache are retained, via step 510. 
[0045] From either of the steps 508 and 510, next 
branches are identified, target addresses are computed, 
and they are predicted taken or not taken, via step 512 
based upon the branch logic 106' and the BHT 104' op- 
erating in a conventional manner. Thereafter, it is then 
determined if there is a predicted branch taken in the 
instruction group 0, via step 514. If there is a predicted 
taken branch in instruction group 0, then the subsequent 
instructions are invalidated, via step 516. 
[0046] Next, it is determined whether address 1 is 
equal to the target address 0 of the branch, via step 518. 
If the answer is yes, then it is determined whether there 
is a predicted taken branch in group 1 or in the next 
group of instructions, via step 520. If the answer to that 
question is yes, then the subsequent instructions in in- 
struction group 1 are invalidated, via step 522. Then it 
is determined whether address 2 is equal to target 1 of 
the branch, via step 524. If the answer to that is yes, 
then the branch queue stores the branch addresses and 
prediction information for all the branches, via step 526. 
If the answer is no, then the next cycle group is invali- 
dated and the override address equals target 1 , via step 
528. Thereafter, the FTAC address is updated to equal 
target 1 , via 530, and return to step 526. 
[0047] If address 1 is not equal to target 0 via step 
51 8, then all the instructions in group 1 and the next cy- 
cle groups are invalidated and the override address 
equals target 0 and prepare to save next group in aux- 
iliary cache, via step 521. Thereafter, the auxiliary ad- 
dress is updated to equal target 0 and the FTAC address 
equals target 0 + 16, via step 523. 
[0048] If, on the other hand, there is no predicted 
branch taken in group 1 , via step 520, then it is then 
determined if address 2 is equal to address 1 plus 16, 
via step 532. If the answer is yes, then return to step 
526. If the answer is no, then all the next cycle groups 
are invalidated and the override address set to equal 
address 1+16, via step 534. Thereafter the FTAC ad- 
dress is updated to equal address 1+16, via step 536, 
and then return to step 526. 

[0049] Returning now to step 514, if there is no pre- 
dicted branches taken in group 0, then it is determined 
if address 1 equals address 0 + 16. If address 1 is equal 
to address 0 + 16, then return to step 520 and proceed 
through the steps based on that decision chain. If, on 



the other hand, the answer is no, that address 1 does 
not equal address 0 + 16, then all instructions in group 
1 and the next cycle groups are invalidated and the over- 
ride address equals address 0 + 16, via step 540. There- 
after, the auxiliary address is updated to equal invalid 
and the FTAC address is updated to equal invalid, via 
step 542. Then return to step 526. Accordingly, through 
this branch prediction process, the system can accumu- 
late branch history information in a manner to allow for 
the overlay instruction of auxiliary cache to efficiently 
fetch a noncontiguous instructions. To more clearly de- 
scribe the operation in the context of a particular exam- 
ple, refer now to Figure 6. 

[0050] Figure 6 is an example 600 that illustrates the 
flow of instructions when utilizing the branch prediction 
algorithm of Figure 5. The example 600 shown in Figure 
6 illustrates a sequence of fetches 602 performed on 
consecutive cycles in accordance with the present in- 
vention for a program segment 604. Note that all the ad- 
dresses are described using a hexadecimal format 
(base 1 6). Asterisks in the example 600 indicate fetched 
instructions that were invalidated from the instruction 
stream. 

[0051] As is seen, the program segment comprises a 
plurality of basic blocks 606, 608, 610 and 612. Each of 
the basic blocks 606-612 begin with a load instruction 
and end with a branch instruction. The basic blocks are 
used in conjunction with the present invention to allow 
for non-contiguous instructions to be obtained in a single 
cycle. 

[0052] Through the use of the branch prediction algo- 
rithm of Figure 5 in conjunction with the hardware mech- 
anisms of Figure 4, to accumulation branch history in- 
formation, noncontiguous instructions of Figure 6 can 
be obtained in a single cycle. 

[0053] To illustrate the method for obtaining instruc- 
tions in the single cycle, refer to Figures 4, 5 and 6. As 
has been indicated, at cycle 000 there are eight instruc- 
tions being provided. It is assumed that the auxiliary 
cache 417 initially contains no instructions, and so at 
this time there are invalid instructions found in the infor- 
mation cache 502, then it is determined whether valid 
instructions are found in the auxiliary cache 506, and 
the answer would be no. In that event, then all the in- 
structions from the instruction cache would be retained. 
At that point, branches are identified, target addresses 
are computed, and predicted taken or not taken, via step 
512. It is known that the target for the branch in the first 
set of instructions is at 0x100. It is now determined 
whether there is a predicted branch taken in group 0, 
and the answer to that is yes. That is the third instruction 
in address 000. Through the branch prediction process 
the instruction at address 1 00 is provided to the auxiliary 
and the address is stored in the auxiliary directory. Also, 
through the branch prediction process address 110 
would be stored in the BTAC 108 (step 530). 
[0054] Then the next basic block 608 is used at cycle 
003 to load the instructions at the target address 0x1 00. 
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The last instructions at basic block 608 is a branch to 
address 200. Then address 100, as before mentioned, 
is then fetched and similar information is accumulated 
for it in the auxiliary cache, auxiliary directory and BTAC. 
Accordingly, information is accumulated in cycles 
003-007 until, as is seen in cycle 008, two non-contigu- 
ous instructions (0X000 and 0X100) are retrieved in a 
single cycle. 

[0055] This branch prediction process is repeated 
again, via basic blocks 610 and 612, wherein non-con- 
tiguous instructions are fetched in cycles 020 through 
024. Accordingly, as is seen in this example, after 
branch history information is accumulated in a sufficient 
manner, then noncontiguous instructions can be ob- 
tained in a single cycle. This process can be repeated, 
particularly in those instances where instructions recur, 
such that a majority of noncontiguous instructions can 
be retrieved in a single cycle. This is accomplished 
through the use of the auxiliary cache and the miscella- 
neous branch logic cooperating with the branch history 
tables while utilizing the branch prediction process in ac- 
cordance with the present invention. 
[0056] Several other techniques, such as trace cach- 
ing and multi-level branch predictors; have been pro- 
posed for allowing a processor to fetch noncontiguous 
instructions in a single cycle. The auxiliary cache and 
instruction overlay technique described above is simpler 
than other techniques, but as effective. 
[0057] Although the present invention has been de- 
scribed in accordance with the embodiments shown, 
one of ordinary skill in the art will readily recognize that 
there could be variations to the embodiments and those 
variations would be within the spirit and scope of the 
present invention. Accordingly, many modifications may 
be made by one of ordinary skill in the art without de- 
parting from the scope of the appended claims. 



Claims 

1 . Apparatus for fetching noncontiguous blocks of in- 
structions in a data processing system; the system 
comprising: 

an instruction cache means for providing a first 
plurality of instructions; 

branch logic means for receiving the first plu- 
rality of instructions and for providing branch 
history information about the first plurality of in- 
structions; and 



the second plurality of instructions. 

2. Apparatus as claimed in claim 1 in which the auxil- 
iary cache means comprises an auxiliary cache and 

5 an auxiliary directory. 

3. Apparatus as claimed in claim 1 or claim 2 in which 
the' first plurality of instructions comprises two 
blocks of instructions. 

10 

4. Apparatus as claimed in any preceding claim in 
which the second plurality of instructions comprises 
one block of instructions. 

is 5. Apparatus as claimed in any preceding claim which 
further comprises a branch address target cache 
coupled to the branch logic. 

6. A method for obtaining non-contiguous blocks of in- 
20 structions in a data processing system; the method 

comprising the steps of: 

(a) storing a first plurality of instructions in a first 
cache; 

25 

(b) fetching the first plurality of instructions in 
parallel with a fetch of a second plurality of in- 
structions within a second cache, the number 
of the second plurality of instructions being 

30 greater than the number of the first plurality of 

instructions; and 

(c) replacing a portion of the second plurality of 
instructions with at least one of the first plurality 

35 of instructions based upon a branch history in- 

formation of the data processing system. 

7. The method of claim 6 wherein the first cache com- 
prises an instruction cache and the second cache 

40 comprises an auxiliary cache. 

8. The method of claim 7 in which the auxiliary cache 
includes an auxiliary directory. 

45 9. The method of any of claims 6 to 8 which the second 
plurality of instructions comprises two blocks of in- 
structions. 

10. The method of any of claims 6 to 9 in which the first 
so plurality of instructions comprises one block of in- 
structions. 



25 



30 



an auxiliary cache means for receiving a sec- 
ond plurality of instructions based upon the 
branch history information; the auxiliary cache 55 
means overlaying at least a one of the second 
plurality of instructions if there is a branch in the 
first plurality of instructions and the branch is to 
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